Automatically enriching spoken corpora with syntactic information for linguistic studies

نویسندگان

Alexis Nasr

Frédéric Béchet

Benoît Favre

Thierry Bazillon

José Deulofeu

André Valli

چکیده

Syntactic parsing of speech transcriptions faces the problem of the presence of disfluencies that break the syntactic structure of the utterances. We propose in this paper two solutions to this problem. The first one relies on a disfluencies predictor that detects disfluencies and removes them prior to parsing. The second one integrates the disfluencies in the syntactic structure of the utterances and train a disfluencies aware parser.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detecting Annotation Errors in Spoken Language Corpora

Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), more recently work has also started to address errors in syntactic and other...

متن کامل

Comparative study of oral and written French automatically tagged with morpho-syntactic information

In this paper, we investigate automatic tagging of French corpora and compare morpho-syntactic properties of spoken and written language on corpora from different sources. Morpho-syntactic properties are first described according to the distribution of the 8 main POS in five corpora of about 1 million words each. The automatic tagging was made with about a hundred tags and we will describe the ...

متن کامل

How Spoken Language Corpora Can Refine Current Speech Motor Training Methodologies

The growing availability of spoken language corpora presents new opportunities for enriching the methodologies of speech and language therapy. In this paper, we present a novel approach for constructing speech motor exercises, based on linguistic knowledge extracted from spoken language corpora. In our study with the Dutch Spoken Corpus, syllabic inventories were obtained by means of automatic ...

متن کامل

Integrating Linguistic and Signal Knowledge in a Morpheme Based Speech Corpus Annotation Tool

As more and more speech systems require high-level linguistic knowledge to accommodate various levels of applications, corpora that are tagged with high-level linguistic annotations as well as signal-level annotations are highly recommended for development of today's speech systems. Among the high-level linguistic annotations, POS (part-of-speech) tag annotations are indispensable in speech cor...

متن کامل

Towards an integrated representation of multiple layers of linguistic annotation in multilingual corpora

There has been an increasing interest in recent years in the enrichment of natural language corpora in terms of annotation with explicit linguistic information. This interest manifests itself most prominently in two areas of linguistics: corpus linguistics and computational linguistics. For corpus linguistics, the long standing practice has been to work on raw, i.e., unannotated text. While raw...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Automatically enriching spoken corpora with syntactic information for linguistic studies

نویسندگان

چکیده

منابع مشابه

Detecting Annotation Errors in Spoken Language Corpora

Comparative study of oral and written French automatically tagged with morpho-syntactic information

How Spoken Language Corpora Can Refine Current Speech Motor Training Methodologies

Integrating Linguistic and Signal Knowledge in a Morpheme Based Speech Corpus Annotation Tool

Towards an integrated representation of multiple layers of linguistic annotation in multilingual corpora

عنوان ژورنال:

اشتراک گذاری